{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "afd55886-5f5b-4794-838e-ef8179fb0394",
   "metadata": {},
   "source": [
    "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
    "```\n",
    "make venv\n",
    "source venv/bin/activate && pip install jupyterlab\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "## This is here as a reference only\n",
    "# Users and application developers must use the right tag for the latest from pypi\n",
    "#!pip install data-prep-toolkit\n",
    "#!pip install data-prep-toolkit-transforms\n",
    "!pip install polars"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebf1f782-0e61-485c-8670-81066beb734c",
   "metadata": {},
   "source": [
    "##### ***** Import required Classes and modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bae63d15-4ce5-4f2a-a917-0f3161e9dd73",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dpk_collapse.ray.runtime import Collapse"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7234563c-2924-4150-8a31-4aec98c1bf33",
   "metadata": {},
   "source": [
    "##### ***** Setup runtime parameters for this transform\n",
    "We will only provide a description for the parameters used in this example. For a complete list of parameters, please refer to the README.md for this transform:\n",
    "|parameter:type | value | description |\n",
    "|-|-|-|\n",
    "| input_folder:str | \\${PWD}/test-data/input/ | folder that contains the input parquet files for the collpase algorithm |\n",
    "| output_folder:str | \\${PWD}/test-data/output/ | folder that contains the all the intermediate results and the output parquet files for the collapse algorithm |\n",
    "| collapse_input_columns: list[str] | user defined | list of column names to join together into a single column |\n",
    "| collpase_output_column:str | user defined | column name that will be created to receive the joined text |\n",
    "| collpase_field_seperator:str | user defined | Seperator used for concatenated text |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a54a78e9-d78b-4aeb-ac2b-806070a2dec0",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "10:48:14 INFO - parameters are : {'collapse_input_columns': ['title', 'contents'], 'collapse_output_column': 'text', 'collapse_field_seperator': '\\n', 'collapse_retain_all': None} at \"/Users/touma/data-prep-kit-pkg/transforms/universal/collapse/dpk_collapse/transform.py:139\"\n",
      "10:48:14 INFO - pipeline id pipeline_id\n",
      "10:48:14 INFO - code location None\n",
      "10:48:14 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n",
      "10:48:14 INFO - actor creation delay 0\n",
      "10:48:14 INFO - job details {'job category': 'preprocessing', 'job name': 'collapse', 'job type': 'ray', 'job id': 'job_id'}\n",
      "10:48:14 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - test-data/output\n",
      "10:48:14 INFO - data factory data_ max_files -1, n_sample -1\n",
      "10:48:14 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
      "10:48:14 INFO - Running locally\n",
      "2025-04-19 10:48:15,216\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n",
      "\u001b[36m(orchestrate pid=61234)\u001b[0m 10:48:16 INFO - orchestrator started at 2025-04-19 10:48:16\n",
      "\u001b[36m(orchestrate pid=61234)\u001b[0m 10:48:16 INFO - Number of files is 1, source profile {'max_file_size': 0.034458160400390625, 'min_file_size': 0.034458160400390625, 'total_file_size': 0.034458160400390625}\n",
      "\u001b[36m(orchestrate pid=61234)\u001b[0m 10:48:16 INFO - Cluster resources: {'cpus': 12, 'gpus': 0, 'memory': 19.676039123907685, 'object_store': 2.0}\n",
      "\u001b[36m(orchestrate pid=61234)\u001b[0m 10:48:16 INFO - Number of workers - 1 with {'num_cpus': 0.8, 'max_restarts': -1} each\n",
      "\u001b[36m(RayTransformFileProcessor pid=61279)\u001b[0m 10:48:16 DEBUG - input columns: ['title', 'contents'] output column: text field seperator: '\n",
      "\u001b[36m(RayTransformFileProcessor pid=61279)\u001b[0m ' retain all: None  at \"/Users/touma/data-prep-kit-pkg/transforms/universal/collapse/dpk_collapse/transform.py:54\"\n",
      "\u001b[36m(orchestrate pid=61234)\u001b[0m 10:48:17 INFO - Completed 0 files (0.0%)  in 0.0 min. Waiting for completion\n",
      "\u001b[36m(orchestrate pid=61234)\u001b[0m 10:48:17 INFO - Completed processing 1 files in 0.004 min\n",
      "\u001b[36m(orchestrate pid=61234)\u001b[0m 10:48:17 INFO - done flushing in 0.001 sec\n",
      "\u001b[33m(raylet)\u001b[0m [2025-04-19 10:48:25,254 E 61228 7928441] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2025-04-19_10-48-14_269914_61190 is over 95% full, available space: 15155265536; capacity: 494662586368. Object creation will fail if spilling is required.\n",
      "10:48:27 INFO - Completed execution in 0.223 min, execution result 0\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Collapse(\n",
    "    input_folder=\"test-data/input\",\n",
    "    output_folder=\"test-data/output\",\n",
    "    collapse_input_columns=[\"title\",\"contents\"],\n",
    "    collapse_output_column=\"text\",\n",
    "    run_locally=True,\n",
    ").transform()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
   "metadata": {},
   "source": [
    "##### **** The specified folder will include the transformed parquet files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7276fe84-6512-4605-ab65-747351e13a7c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['test-data/output/sample1.parquet', 'test-data/output/metadata.json']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import glob\n",
    "glob.glob(\"test-data/output/*\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d30489d9-fc98-423e-90a8-e8f372787e88",
   "metadata": {},
   "source": [
    "***** print the input data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "5b22234f-f7a1-4b92-b2ac-376b2545abce",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0.1</th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>document_id</th>\n",
       "      <th>document</th>\n",
       "      <th>title</th>\n",
       "      <th>contents</th>\n",
       "      <th>language</th>\n",
       "      <th>doc_title</th>\n",
       "      <th>full_domain</th>\n",
       "      <th>filename</th>\n",
       "      <th>...</th>\n",
       "      <th>scope</th>\n",
       "      <th>tags</th>\n",
       "      <th>topics</th>\n",
       "      <th>contents_sha2</th>\n",
       "      <th>version</th>\n",
       "      <th>extras</th>\n",
       "      <th>dataset</th>\n",
       "      <th>domain</th>\n",
       "      <th>date_crawled</th>\n",
       "      <th>date_downloaded</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>d07817ed05f795177bfc4952b11fa3e21cbba9a92ed440...</td>\n",
       "      <td>s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...</td>\n",
       "      <td>https://www.ibm.com/docs/en/zvm/7.2?topic=subc...</td>\n",
       "      <td>Making a Selective Change Suppose you want to ...</td>\n",
       "      <td>en</td>\n",
       "      <td>Making a Selective Change</td>\n",
       "      <td>ibm_internal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>ibmdocs</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>841d761c46cde491b9642b0eca25cd164fc32da24c6c4e...</td>\n",
       "      <td>1.0.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ibm.com</td>\n",
       "      <td>internal</td>\n",
       "      <td>2023-07-27 05:00:00+00:00</td>\n",
       "      <td>2023-09-25 05:00:00+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3e4d6c6c89dd166c88d79a6cbe3d90c8db2c9847fca198...</td>\n",
       "      <td>s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...</td>\n",
       "      <td>https://www.ibm.com/docs/en/ztpf/2022?topic=me...</td>\n",
       "      <td>NKEY - NNCS NKEY0004W SINCE MAXSRT IS ZERO, ke...</td>\n",
       "      <td>en</td>\n",
       "      <td>NKEY - NNCS</td>\n",
       "      <td>ibm_internal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>ibmdocs</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ebfb721ab47c49c13d4b9b8268dc1aa8dd62089c1be25e...</td>\n",
       "      <td>1.0.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ibm.com</td>\n",
       "      <td>internal</td>\n",
       "      <td>2023-07-27 05:00:00+00:00</td>\n",
       "      <td>2023-09-25 05:00:00+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>c86996cf20920d0955a38580abb650b00d0e1df5f7bd98...</td>\n",
       "      <td>s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...</td>\n",
       "      <td>https://www.ibm.com/docs/en/ztpf/2021?topic=nt...</td>\n",
       "      <td>NSPA0014E REJECTED, SYSTEM IS BELOW 1052 STATE...</td>\n",
       "      <td>en</td>\n",
       "      <td>NSPA0014E</td>\n",
       "      <td>ibm_internal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>ibmdocs</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>79c94bb45bfe357042bcf1a56ae75f6545f0ffe2d6b324...</td>\n",
       "      <td>1.0.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ibm.com</td>\n",
       "      <td>internal</td>\n",
       "      <td>2023-07-27 05:00:00+00:00</td>\n",
       "      <td>2023-09-25 05:00:00+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>c86996cf20920d0955a38580abb650b00d0e1df5f7bd98...</td>\n",
       "      <td>s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...</td>\n",
       "      <td>https://www.ibm.com/docs/en/ztpf/2021?topic=nt...</td>\n",
       "      <td>NSPA0014E REJECTED, SYSTEM IS BELOW 1052 STATE...</td>\n",
       "      <td>en</td>\n",
       "      <td>NSPA0014E</td>\n",
       "      <td>ibm_internal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>ibmdocs</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>79c94bb45bfe357042bcf1a56ae75f6545f0ffe2d6b324...</td>\n",
       "      <td>1.0.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ibm.com</td>\n",
       "      <td>internal</td>\n",
       "      <td>2023-07-27 05:00:00+00:00</td>\n",
       "      <td>2023-09-25 05:00:00+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>3e4d6c6c89dd166c88d79a6cbe3d90c8db2c9847fca198...</td>\n",
       "      <td>s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...</td>\n",
       "      <td>https://www.ibm.com/docs/en/ztpf/2022?topic=me...</td>\n",
       "      <td>NKEY - NNCS NKEY0004W SINCE MAXSRT IS ZERO, ke...</td>\n",
       "      <td>en</td>\n",
       "      <td>NKEY - NNCS</td>\n",
       "      <td>ibm_internal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>ibmdocs</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ebfb721ab47c49c13d4b9b8268dc1aa8dd62089c1be25e...</td>\n",
       "      <td>1.0.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ibm.com</td>\n",
       "      <td>internal</td>\n",
       "      <td>2023-07-27 05:00:00+00:00</td>\n",
       "      <td>2023-09-25 05:00:00+00:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 38 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0.1  Unnamed: 0  \\\n",
       "0             0           0   \n",
       "1             1           1   \n",
       "2             2           2   \n",
       "3             3           3   \n",
       "4             4           4   \n",
       "\n",
       "                                         document_id  \\\n",
       "0  d07817ed05f795177bfc4952b11fa3e21cbba9a92ed440...   \n",
       "1  3e4d6c6c89dd166c88d79a6cbe3d90c8db2c9847fca198...   \n",
       "2  c86996cf20920d0955a38580abb650b00d0e1df5f7bd98...   \n",
       "3  c86996cf20920d0955a38580abb650b00d0e1df5f7bd98...   \n",
       "4  3e4d6c6c89dd166c88d79a6cbe3d90c8db2c9847fca198...   \n",
       "\n",
       "                                            document  \\\n",
       "0  s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...   \n",
       "1  s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...   \n",
       "2  s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...   \n",
       "3  s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...   \n",
       "4  s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...   \n",
       "\n",
       "                                               title  \\\n",
       "0  https://www.ibm.com/docs/en/zvm/7.2?topic=subc...   \n",
       "1  https://www.ibm.com/docs/en/ztpf/2022?topic=me...   \n",
       "2  https://www.ibm.com/docs/en/ztpf/2021?topic=nt...   \n",
       "3  https://www.ibm.com/docs/en/ztpf/2021?topic=nt...   \n",
       "4  https://www.ibm.com/docs/en/ztpf/2022?topic=me...   \n",
       "\n",
       "                                            contents language  \\\n",
       "0  Making a Selective Change Suppose you want to ...       en   \n",
       "1  NKEY - NNCS NKEY0004W SINCE MAXSRT IS ZERO, ke...       en   \n",
       "2  NSPA0014E REJECTED, SYSTEM IS BELOW 1052 STATE...       en   \n",
       "3  NSPA0014E REJECTED, SYSTEM IS BELOW 1052 STATE...       en   \n",
       "4  NKEY - NNCS NKEY0004W SINCE MAXSRT IS ZERO, ke...       en   \n",
       "\n",
       "                   doc_title   full_domain  filename  ...    scope tags  \\\n",
       "0  Making a Selective Change  ibm_internal       NaN  ...  ibmdocs  NaN   \n",
       "1                NKEY - NNCS  ibm_internal       NaN  ...  ibmdocs  NaN   \n",
       "2                  NSPA0014E  ibm_internal       NaN  ...  ibmdocs  NaN   \n",
       "3                  NSPA0014E  ibm_internal       NaN  ...  ibmdocs  NaN   \n",
       "4                NKEY - NNCS  ibm_internal       NaN  ...  ibmdocs  NaN   \n",
       "\n",
       "  topics                                      contents_sha2 version extras  \\\n",
       "0    NaN  841d761c46cde491b9642b0eca25cd164fc32da24c6c4e...   1.0.1    NaN   \n",
       "1    NaN  ebfb721ab47c49c13d4b9b8268dc1aa8dd62089c1be25e...   1.0.1    NaN   \n",
       "2    NaN  79c94bb45bfe357042bcf1a56ae75f6545f0ffe2d6b324...   1.0.1    NaN   \n",
       "3    NaN  79c94bb45bfe357042bcf1a56ae75f6545f0ffe2d6b324...   1.0.1    NaN   \n",
       "4    NaN  ebfb721ab47c49c13d4b9b8268dc1aa8dd62089c1be25e...   1.0.1    NaN   \n",
       "\n",
       "   dataset    domain               date_crawled            date_downloaded  \n",
       "0  ibm.com  internal  2023-07-27 05:00:00+00:00  2023-09-25 05:00:00+00:00  \n",
       "1  ibm.com  internal  2023-07-27 05:00:00+00:00  2023-09-25 05:00:00+00:00  \n",
       "2  ibm.com  internal  2023-07-27 05:00:00+00:00  2023-09-25 05:00:00+00:00  \n",
       "3  ibm.com  internal  2023-07-27 05:00:00+00:00  2023-09-25 05:00:00+00:00  \n",
       "4  ibm.com  internal  2023-07-27 05:00:00+00:00  2023-09-25 05:00:00+00:00  \n",
       "\n",
       "[5 rows x 38 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "import polars as pl\n",
    "input_df = pl.read_parquet(os.path.join(os.path.abspath(\"\"), \"test-data/input/sample1.parquet\"))\n",
    "input_df.to_pandas()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5305d127-10fd-4fa6-97a6-ac47db2bdc7e",
   "metadata": {},
   "source": [
    "***** print the output result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "0b2eddb9-4fb6-41eb-916c-3741b9129f2c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0.1</th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>document_id</th>\n",
       "      <th>document</th>\n",
       "      <th>language</th>\n",
       "      <th>doc_title</th>\n",
       "      <th>full_domain</th>\n",
       "      <th>filename</th>\n",
       "      <th>download_code_version</th>\n",
       "      <th>download_config</th>\n",
       "      <th>...</th>\n",
       "      <th>tags</th>\n",
       "      <th>topics</th>\n",
       "      <th>contents_sha2</th>\n",
       "      <th>version</th>\n",
       "      <th>extras</th>\n",
       "      <th>dataset</th>\n",
       "      <th>domain</th>\n",
       "      <th>date_crawled</th>\n",
       "      <th>date_downloaded</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>d07817ed05f795177bfc4952b11fa3e21cbba9a92ed440...</td>\n",
       "      <td>s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...</td>\n",
       "      <td>en</td>\n",
       "      <td>Making a Selective Change</td>\n",
       "      <td>ibm_internal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://raw.github.ibm.com/ai-models-data/data...</td>\n",
       "      <td>screen -dm bash -c 'python ibm_com_download.py'</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>841d761c46cde491b9642b0eca25cd164fc32da24c6c4e...</td>\n",
       "      <td>1.0.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ibm.com</td>\n",
       "      <td>internal</td>\n",
       "      <td>2023-07-27 05:00:00+00:00</td>\n",
       "      <td>2023-09-25 05:00:00+00:00</td>\n",
       "      <td>https://www.ibm.com/docs/en/zvm/7.2?topic=subc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3e4d6c6c89dd166c88d79a6cbe3d90c8db2c9847fca198...</td>\n",
       "      <td>s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...</td>\n",
       "      <td>en</td>\n",
       "      <td>NKEY - NNCS</td>\n",
       "      <td>ibm_internal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://raw.github.ibm.com/ai-models-data/data...</td>\n",
       "      <td>screen -dm bash -c 'python ibm_com_download.py'</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ebfb721ab47c49c13d4b9b8268dc1aa8dd62089c1be25e...</td>\n",
       "      <td>1.0.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ibm.com</td>\n",
       "      <td>internal</td>\n",
       "      <td>2023-07-27 05:00:00+00:00</td>\n",
       "      <td>2023-09-25 05:00:00+00:00</td>\n",
       "      <td>https://www.ibm.com/docs/en/ztpf/2022?topic=me...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>c86996cf20920d0955a38580abb650b00d0e1df5f7bd98...</td>\n",
       "      <td>s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...</td>\n",
       "      <td>en</td>\n",
       "      <td>NSPA0014E</td>\n",
       "      <td>ibm_internal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://raw.github.ibm.com/ai-models-data/data...</td>\n",
       "      <td>screen -dm bash -c 'python ibm_com_download.py'</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>79c94bb45bfe357042bcf1a56ae75f6545f0ffe2d6b324...</td>\n",
       "      <td>1.0.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ibm.com</td>\n",
       "      <td>internal</td>\n",
       "      <td>2023-07-27 05:00:00+00:00</td>\n",
       "      <td>2023-09-25 05:00:00+00:00</td>\n",
       "      <td>https://www.ibm.com/docs/en/ztpf/2021?topic=nt...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>c86996cf20920d0955a38580abb650b00d0e1df5f7bd98...</td>\n",
       "      <td>s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...</td>\n",
       "      <td>en</td>\n",
       "      <td>NSPA0014E</td>\n",
       "      <td>ibm_internal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://raw.github.ibm.com/ai-models-data/data...</td>\n",
       "      <td>screen -dm bash -c 'python ibm_com_download.py'</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>79c94bb45bfe357042bcf1a56ae75f6545f0ffe2d6b324...</td>\n",
       "      <td>1.0.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ibm.com</td>\n",
       "      <td>internal</td>\n",
       "      <td>2023-07-27 05:00:00+00:00</td>\n",
       "      <td>2023-09-25 05:00:00+00:00</td>\n",
       "      <td>https://www.ibm.com/docs/en/ztpf/2021?topic=nt...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>3e4d6c6c89dd166c88d79a6cbe3d90c8db2c9847fca198...</td>\n",
       "      <td>s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...</td>\n",
       "      <td>en</td>\n",
       "      <td>NKEY - NNCS</td>\n",
       "      <td>ibm_internal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://raw.github.ibm.com/ai-models-data/data...</td>\n",
       "      <td>screen -dm bash -c 'python ibm_com_download.py'</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ebfb721ab47c49c13d4b9b8268dc1aa8dd62089c1be25e...</td>\n",
       "      <td>1.0.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ibm.com</td>\n",
       "      <td>internal</td>\n",
       "      <td>2023-07-27 05:00:00+00:00</td>\n",
       "      <td>2023-09-25 05:00:00+00:00</td>\n",
       "      <td>https://www.ibm.com/docs/en/ztpf/2022?topic=me...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 37 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0.1  Unnamed: 0  \\\n",
       "0             0           0   \n",
       "1             1           1   \n",
       "2             2           2   \n",
       "3             3           3   \n",
       "4             4           4   \n",
       "\n",
       "                                         document_id  \\\n",
       "0  d07817ed05f795177bfc4952b11fa3e21cbba9a92ed440...   \n",
       "1  3e4d6c6c89dd166c88d79a6cbe3d90c8db2c9847fca198...   \n",
       "2  c86996cf20920d0955a38580abb650b00d0e1df5f7bd98...   \n",
       "3  c86996cf20920d0955a38580abb650b00d0e1df5f7bd98...   \n",
       "4  3e4d6c6c89dd166c88d79a6cbe3d90c8db2c9847fca198...   \n",
       "\n",
       "                                            document language  \\\n",
       "0  s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...       en   \n",
       "1  s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...       en   \n",
       "2  s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...       en   \n",
       "3  s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...       en   \n",
       "4  s3://blue-pile/blue-pile-raw/5_ibm_internal/0_...       en   \n",
       "\n",
       "                   doc_title   full_domain  filename  \\\n",
       "0  Making a Selective Change  ibm_internal       NaN   \n",
       "1                NKEY - NNCS  ibm_internal       NaN   \n",
       "2                  NSPA0014E  ibm_internal       NaN   \n",
       "3                  NSPA0014E  ibm_internal       NaN   \n",
       "4                NKEY - NNCS  ibm_internal       NaN   \n",
       "\n",
       "                               download_code_version  \\\n",
       "0  https://raw.github.ibm.com/ai-models-data/data...   \n",
       "1  https://raw.github.ibm.com/ai-models-data/data...   \n",
       "2  https://raw.github.ibm.com/ai-models-data/data...   \n",
       "3  https://raw.github.ibm.com/ai-models-data/data...   \n",
       "4  https://raw.github.ibm.com/ai-models-data/data...   \n",
       "\n",
       "                                   download_config  ... tags topics  \\\n",
       "0  screen -dm bash -c 'python ibm_com_download.py'  ...  NaN    NaN   \n",
       "1  screen -dm bash -c 'python ibm_com_download.py'  ...  NaN    NaN   \n",
       "2  screen -dm bash -c 'python ibm_com_download.py'  ...  NaN    NaN   \n",
       "3  screen -dm bash -c 'python ibm_com_download.py'  ...  NaN    NaN   \n",
       "4  screen -dm bash -c 'python ibm_com_download.py'  ...  NaN    NaN   \n",
       "\n",
       "                                       contents_sha2 version  extras  dataset  \\\n",
       "0  841d761c46cde491b9642b0eca25cd164fc32da24c6c4e...   1.0.1     NaN  ibm.com   \n",
       "1  ebfb721ab47c49c13d4b9b8268dc1aa8dd62089c1be25e...   1.0.1     NaN  ibm.com   \n",
       "2  79c94bb45bfe357042bcf1a56ae75f6545f0ffe2d6b324...   1.0.1     NaN  ibm.com   \n",
       "3  79c94bb45bfe357042bcf1a56ae75f6545f0ffe2d6b324...   1.0.1     NaN  ibm.com   \n",
       "4  ebfb721ab47c49c13d4b9b8268dc1aa8dd62089c1be25e...   1.0.1     NaN  ibm.com   \n",
       "\n",
       "     domain               date_crawled            date_downloaded  \\\n",
       "0  internal  2023-07-27 05:00:00+00:00  2023-09-25 05:00:00+00:00   \n",
       "1  internal  2023-07-27 05:00:00+00:00  2023-09-25 05:00:00+00:00   \n",
       "2  internal  2023-07-27 05:00:00+00:00  2023-09-25 05:00:00+00:00   \n",
       "3  internal  2023-07-27 05:00:00+00:00  2023-09-25 05:00:00+00:00   \n",
       "4  internal  2023-07-27 05:00:00+00:00  2023-09-25 05:00:00+00:00   \n",
       "\n",
       "                                                text  \n",
       "0  https://www.ibm.com/docs/en/zvm/7.2?topic=subc...  \n",
       "1  https://www.ibm.com/docs/en/ztpf/2022?topic=me...  \n",
       "2  https://www.ibm.com/docs/en/ztpf/2021?topic=nt...  \n",
       "3  https://www.ibm.com/docs/en/ztpf/2021?topic=nt...  \n",
       "4  https://www.ibm.com/docs/en/ztpf/2022?topic=me...  \n",
       "\n",
       "[5 rows x 37 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output_df = pl.read_parquet(os.path.join(os.path.abspath(\"\"), \"test-data/output/sample1.parquet\"))\n",
    "output_df.to_pandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d60e391d-cf58-47ae-9991-04c05d114edc",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "787c644e-2640-4c05-bdc2-8a261305a89f",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
